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Figure 1: Conceptual overview of the architecture for the lookup method 



1,0 Ov e rvi e w 

On e of th e limiting fitctor in futur e high sp ee d IP rout e ip io th e probl e m of p e ifoiming th e 
long e ot pr e fix match to d e cid e out of which port th e incoming packet should be rout e d. — ^fhQ 
long e st prefix match involv e s looldng at th e destination addr e ss on th e incoming pack e t and 
finding th e long e st pr e fix in the routing tabl e that matches it. — As th e rot e at which data pack e t 
can b e transntutt e d over th e optical fib e rs is incr e asing, th e numb e r of lookups that n ee d to b e 
don e to k ee p with this sp ee d is also incr e asing rapidly. Here w e propose a hardwar e solution in 
th e form of a CAM (Cont e nt Addressable M e mory) based ASIC (Application Sp e cific Integrated 
Circuit) that allows looleups to k ee p up with tranfimission sp ee ds ov e r optical fib e rs for th e n e xt 
s e v e ral y e ars. Th e advantages of this m e thod ar e : 
Discussion of Prior Art. 

One of the most important applications for this invention is to perform address lookups in 
routers. A router is an electronic device that has several input and output ports. It receives data 
packets with destination addresses on each of these ports. The lookup function involves looking 
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up these destination addresses in a table called a lookup table to determine which output port 
this particular data packet should be sent to, so that it gets to it's destination most efficiently. 
The router then sends the packet out of the output port determined by this lookup. 
One method of doing this is to build a list of all the possible destinations and the best port to 
send a packet out of to reach each destination. Then each destination address can be searched in 
this table and the best outgoing port can be determined. However, in large networks like the 
internet the number of such destinations is so large that such a table becomes too large to be 
implemented in each routers. Another consideration is the maintenance of this table. Now each 
time a new destination is added to the network each router in the network has to be informed of 
this. This is very cumbersome for large networks. 

Hence, to solve this problem, a prefix based lookup scheme is used to carry out routing in 
modem internet routers. The idea here is that the network is arranged in a hierarchical fashion 
and the addresses are allocated accordingly, somewhat similar to a postal address. For example 
take an imaginary postal address like 123, Some Street, Someville, Somestate, US. The zip code 
has been dropped to make the postal example more analogous. Thus, a letter with this address 
posted from anywhere in the world would first be sent to US. Next, the US postal system will 
direct the letter to Somestate, from where it will go to the city Someville and so on and so forth. 
Thus, this system eliminates the requirement for every post office in the world to have knowledge 
of where 123, Some Street is and how to deliver a letter to that address. Similarly prefixes allow 
the aggregation of entire sub-networks under one entry in the lookup table. 
However, there are special cases that need to be taken care of. Again falling back on the postal 
system analogy, from some places in Canada it is more efficient to send a letter to Alaska 
directly there rather than sending it first to the mainland US postal system. Thus, these Canadian 
postal offices would have a letter routing rule book that has two entries: send letters addressed 
to US to the mainland US postal system, send letter addressed to Alaska, US to the Alaska postal 
system. Here clearly the second rule has higher precedence over the first one. This can be 
expressed as longest prefix matching. Thus, one should use the rule with the longest or most 
specific prefix for routing decisions. Similarly, in routers the longer prefix has a higher priority 
than a shorter prefix. This is the basic concept behind CIDR (Classless Inter-Domain Routing) 
which is used in routers. 
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Even though this concept cuts down on the number of entries that need to be maintained in the 
routing table, nevertheless the number of entries in the routing tables of routers in the backbone 
of the internet are large at around 100,000 today. To provide for adequate margin for growth 
during the lifetime of these routers, currently routers are shipped with the ability to support one 
million entries. Today these address are 32 bit long (under a scheme called IPv4) but as the 
stock of available address are depleted, 128 bit long address (IPv6) are coming into use. 
Another factor that is making this task difficult is that the speed of the links connecting these 
routers is growing with rapid advances in technology. The state of the art optical fiber links 
today can run at 10 Gbps (called OC-192). Considering that minimum sized (40 bytes) data 
packet are sent over links of this capacity a lookup speed of slightly over 30 million lookups per 
second is required. Systems currently in development will support link speeds of 40 Gbps (OC- 
768) requiring a lookup speed of over 120 million lookups per second. This lookup speed is 
required for each link to a router. A router may have several links connected to it. Thus, overall 
the problem is to search for the longest prefix match for each address among a million prefixes 
at the speed of several hundred lookups per second. Using just prior art this is a daunting 
problem. The parameters of interest are power consumption, number of chips required to store 
and search the table and the chip area of these chips, latency of search, and the rate at which the 
search can be performed. 

An example of a lookup table used for forwarding is shown in FIG. 1. Each entry in this table is 
32 bits wide. The first column contain the prefix with the prefix length after the V\ Each 32 bit 
address is grouped into four decimal numbers each representing 8 bits. lite four decimal 
numbers are separated by a decimal point. For example the 32 bit long address 1010 1011 0011 
0110 0010 0000 0001 0101 is 171.54.32.21 in this format. 

Using these conventions, the entry, 171.54.32.0/24, refers to the range of addresses from 
171.54.32.0 to 171.54.32.255. Hence, the first 24 bits are defined while the last 8 bits are "don't 
care" bits. Another representation for the prefixes would be 17 1.54. 32. X, where the X stands for 
"don *t care The outgoing port is in the next column. An incoming address can match multiple 
entries. In this case the entry with the longest prefix is chosen as the best match if a CIDR 
algorithm is desired. For example the word 1 71.54.32.123 matches two entries from the table in 
FIG. 1, namely 171.54.0.0/16 and 171.54.32.0/24. However since 171.54.32.0/24 is a longer 
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prefix than 171.54.0.0/16, the best match is 171.54.32,0/24. Another method of establishing 
priority would be to actually specify the priority for each entry. 

An alternate way to represent this table is shown in FIG, 2, Here each prefix is represented as a 
range along a number line (shown at the bottom of the figure). Since we are dealing with 32 bit 
prefix entries in this example, this number line extends from 0.0.0.0 to 255.255.255.255. Each 
prefix is a contiguous range on this number line. The prefixes from the table in FIG. I are shown 
on this number line. Note that the longer prefixes represent shorter ranges on this number line. If 
a longest prefix match is desired, then the first range that matches the address to be looked up 
going from top to bottom in FIG. 2 is the best match. 

There are two general approaches to solving this problem. The first is to use a general CAM 
(Content Addressable Memory) to store and search the entire lookup table. Each CAM cells 
contains two memory elements to store three states (l,0,Xor don 7 care) and comparison 
circuitry to compare the destination IP address to the stored entry. This approach results in 
large silicon area as well as large power consumption as every entry is searched. 
The second approach is to store the lookup table as some data structure in conventional memory. 
For example see U.S. patent 6,011, 795. This data structure is designed to allow efficient lookup 
using a particular algorithm. A specially designed integrated circuit is used to perform the 
lookup on this memory. While the power in this scheme can be low, it suffers from several 
drawbacks. Any data structure involves a lot of wastage due to either empty entries or pointers 
used to navigate the structure. The factor of real prefix data to memory used is 3-4 at best and 
can be as bad as 64. Secondly to run this lookup at a high speed, each level of this data structure 
has to be pipelined. This puts a large I/O requirement on the system. Which is difficult if not 
impossible to meet as the number of lookups required exceed 100 Million lookups per second. 
Hence current techniques are expensive and have unmanageable amount ofworst-case power 
and I/O requirements. Another disadvantage is that the latency of these solutions can be large 
and also the worst-case latency may be much larger than the average case latency. This large 
and possibly uncertain search latency requires larger and more complex buffering of the data 
packets. 

Objects and Advantages 

Accordingly, to address the deficiencies of prior art several objects and advantages of the 
present invention are: 
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♦ It does not suffer fix)m the high power requirements of usual CAM implementations allowing 
the use of che£^ packaging and higher density reducing the chip count. Power does not 
scale with increasing table size, unlike conventional implementations. 

♦ It allows the use of a binary CAM stmcture in place of a temaiy CAM (which can store don't 
cares) giving higher table entries per chip. 

♦ It has low latency which is beneficial to applications like real time voice and video 
transmission 

♦ It can support a high lookiq) rate allowing the routing of a large amount of traffic 

♦ It allows sveral chips to be operated in parallel with ease, to support large lookup table sizes 
as there is no communication required between chips to decide the best match 

Further objects and advantages are to have a solution which is easy to design. Still further 
objects and advantages will become apparent form a consideration of the ensuing description 
and drawings. 

This m e thod is e asily e xtendabl e to solving other problems which requir e looldng an 
e ntry ly in a tabl e at a high rate. 
Brief Summary of the Invention 

This invention provides a method and system an ASIC (Application Specific Integrated Circuit) 
with several CAM arrays to perform a single-dimensional prefix search on the prefixes stored in 
the said array such that as few as one CAM array is activate at a time. Each of these arrays are 
surrounded by special logic that activates only the necessary CAM array. 
Brief Description of the Several Views of the Drawing 
List of Figures 

FIG. 1 shows an example of a forwarding lookup table with one dimensional prefixes. 

FIG. 2 shows the representation of a one dimensional lookup table as ranges along a number 

line going from 0.0.0.0 to 255.255.255.255. 

FIG. 3 shows the concept of dividing the lookup table into different subgroups depending on the 
location along the number line. 

FIG. 4 Chip-level architecture of the preferred embodiment 

FIG. 5 Schematic of one possible implementation of comparing circuitry in one cell of the 
Content Comparing Memory 
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FIG, 6 Schematic of another possible implementation of comparing circuitry in one cell of the 
Content Comparing Memory 

FIG. 7 Schematic of preferred embodiment of one complete cell of the Content Comparing 
Memory 

2,0 Archit e cture De s cr i ption 

Detailed Description of the Invention 

The basic idea behind this invention is to divide the table of prefixes into smaller subgroups. This 
allows this invention to save on power and implementation area requirement as compared to 
prior art To aid in understanding this invention, first the method of dividing a large table of 
prefixes into smaller subgroups will be described. Next the hardware to store, identify and 
search the correct subgroup will be described. 
Basic Theoretical Concept 

Each entry in the lookup table consists of a prefix of a certain length. For example, a 32 
bit address with a prefix length of 16 is 23.123.0.0/16. This prefix can be thought to represent a 
range of addresses fi-om 23.123.0.0 to 23.123.255.255. In figure 1 each of these ranges is 
represented by a square bracket facing upwards. Bitries of the same prefix length are placed on 
the same line. 

Searching entries takes power roughly proportional to the number of entries that need to 
be searched. This scheme saves power by searching only a few entries out of the entire table. 
The way the table is divided is shown in figure 1. Depending on the technology, ch^ size and 
table size, several chips may be required to save and search the entire table. Hence, the first 
division is between chips. Each chip contains entries fix>m only a certain range of the address 
space. Entries that cross this boundary are put in both chips. Thus depending on the range in 
which the address to be matched falls, only one chip needs to be searched for it. 

Within each chip the entries are divided into several packs. These packs will be referred 
to as banks. Each of these banks shares a mask entiy, which stores the information on the 
significant bits in the prefix. This allows the entries to be stored in smaller binary CAMs instead 
of ternary CAMs, which are otherwise required. Snce, each bank contains entries of the same 
length the entries cannot overl£^ with each other. Thus, each address will get at the most one 
match. This eliminates the need to have a priority encoder within each bank to resolve multiple 
matches. For these reasons the second division is based on prefix length. 
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The third division is fiom bank to bank. Depending on the number of entries in each 
prefix length on each chip several banks may be required to store these entries. Each bank 
contains entries contained in a particular address range. Each address lookup needs to only 
activate one of these banks per prefix length, further reducing the power requirement A priority 
encoder is required between banks to detemaine which was the longest prefix match among the 
matches &om different prefix lengths. 

Note that depending on the specific application, technology used and table size, the 
number, order or type of this division can be changed to obtain the optimal design. 
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Figure 2: Schematic of chip used in the implementation of the lookup 

method 
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3.0 Functional Block Level Implementation 

Hiis section deals with the functional block level mq)lementation, while the details of 
circuit inq)lementation is presented in the following sections. Figure 2 shows the schematic of 
the inq)lementatioa Each thick solid line represents a flq)flop. Thus, tiie rcgions between 
flipflops of the same color lie in the same clock domain. The functioning of this schematic will 
be explained by going through a lookup (an address) and add/delete prefix cycle. 
Lookup: 

A particular interface is assumed here for the sake of discussion. In a cycle in which 
there is an address to be looked up, the address is put on the in.addr bus, the packet ID is put on 
the m_pktjd bus and injvalid is asserted. Next this address has to go through the first check 
to find out if it is in the same range as the address in this chip. This is the search on the first 
divisioa This search is done by the use of CCM (Content Conq>aring Memory). In this 
inq)lementation, without loss of generality, CCM is used to compare the incoming data to that in 
the memory and computes if it is greater than or equal to the one in the memory. A possible 
inplementation of the CCM is presented in the next sections. 

So, in the next cycle the incoming address is compared against two CCMs to check if it is 
in the right range. The CCM contain the maximum and minimum of the range of address 
contained in that chip. Chips that do not have addresses in the right range do not have to do any 
further work on this address saving power. The chips that does natch now passes on the addrcss 
to the CAM banks in the next cycle. 

Now, as mentioned before each of these CAM banks contain entries with the same prefix 
length. This prefix length is encoded in the mask present in each bank. The data in the mask 
decides which bits of the incoming address will be compared witii the entries in the bank. Each 
CAM bank also contains a CCM. This CCM stores and compares the least possible address that 
will match the entries in the table with the incoming address. If the incoming address is found to 
be greater than or equal to the data in the CCM but less than (i.e. not greater than or equal to) the 
one in the next bank which contains addresses of the same prefix length, then and only then the 
incoming address is passed to the rest of the CAM bank for comparison. This requires CAM 
banks with prefbces of the same lengtii to be placed next to each other and the addresses to be 
sorted between the banks. Note that the addresses within a bank need not be sorted as only one 
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match can be made for entries of same prefix length. The last in a chain of CAM banks with 
same prefix length should not compare the incoming address with the next CAM bank (as that 
contains prefixes of different length). This is achieved by introducing the last bit. So for the last 
CAM bank in a chain (which has the last bit set) comparison is carried out only with one CCM. 

In the next cycle the conq)arison within each CAM bank that matched (at the most one 
per prefix length) is carried out. The circuit operation and design of these CAM cells is detailed 
in the following sections and hence, will not be covered here. It is sufficient to say here that 
each row of CAM cells (which contain one entry) have an associated memory row (e.g. SRAM) 
containing the tag (which could be the port address that the packet needs to leave the router by). 
If a match is found between the incoming address and one of entries in the bank, corresponding 
tag is outputted and a hit Une is asserted. 

In the next cycle the priority encoder decides which of the CAM banks has got the 
longest prefix match. Again, the workings of the priority encoder are cxfiainsd in detail in the 
following sections. The priority encoder decides the CAM banks with the highest priority and 
lets it output its tag (which is the longest prefix match) onto the out_port bus. 
Update: 

This section shall detail how the data structure is maintained. A processor that maintains 
the update engine gives the update commands. To allow lookups to take place without being 
held up by updates, each update command maintains the data stmcture intact. This requires all 
the CCMs and CAMs at various levels to be updated in one pipeUned operation (so as to leave 
the data structure ready to do a lookup in the next (ycle). This means that each update is one 
clock cycle long and updates each section as it travels down the pipeline. The lookup operation 
can resume after the clock cycle in which the update is introduced to the pipeline. 

To add a new entry to a chip, the entry is placed on the in.addr bus and the 
corresponding tag is placed on the in_port bus and the packet_update is asserted. The bank 
address that this update is directed to is put on the update_blk-addr bus, while the row number 
within this bank is put on the update_cam_addr bus. Now, this addition might change the data 
stmcture, so as to require the modification of the following CCMs: 

• Bank CCM: If the entry is the smallest in that bank, the CCM content has to be updated. The 
update.ccm bus is asserted which ensures this. Note that the in.addr should contain the 
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smallest address that matches the new entry. The mask in the CAM bank will ensure that the 
relevant bits are ignored during lookup. 

• LoJi;CM: This contains the lowest address than can get a match on this chip. Thus, if flie 
incoming entry is the smallest in the chip, the updatejo^ccm is asserted. Again the 
in.addr bus should contain the smallest address that matches the new entry. 

• Hi_CCM: This contains the highest address that can get a match on this chip. Thus, if the 
incoming entry is the largest in the chip, the update_hi_ccm is asserted. In this case the 
in_addr bus should contain the largest address that matches the new entry. Note that this 
update will never require the concurrent updating of the bank CCM. So, putting the largest 
address on the in_addr bus will not cause a problem. 

A delete is similar to an add, except that the entry is set to a special value that will never 
match a valid incoming address. In the design part we first came up witiii a con[q)act circuit 
implementation of the CCM cells. Then we tried to optimize the critical path delay to obtain tiie 
maximum speed. 

4.0 Architecture and Floorplan 

On e of th e issues with th e arehit e oture w e chos e was th e updat e . At first glanc e it o ee mo 
that tti e updat e will b e v e ry tim e consuming and of th e ord e r of numb e r of e ntri e s. But w e 
obs e rv e d that although th e IP e ntri e s should b e sort e d b e tw ee n th e CAM banlcs, th e y do not hav e 
to b e sort e d insid e each CAM bank. This r e duc e s th e numb e r of operationfl n ee d e d for e ach 
updat e from 0(N) (N ~ numb e r of e ntries) to 0(N/M) in which M is th e numb e r of rows in e ach 
CAM banle In our archit e ctur e we assumed M~100, so the updat e operation is on tfi e order of 
lOQO operations. In rar e cas e s wh e n th e boundary b e tw ee n pr e fix l e ngths should b e mov e d 
across th e CAM banks, the numb e r of operations for updat e may increas e . By assigning e nough 
number of CAM banlcs to e ach pr e fix length bas e d on statistics w e try to avoid th e s e kinds of 
updat e s as much as possibl e . 

Th e oth e r issu e with our arehit e oture is that w e hav e to hav e e xtra bins in e ach chip to 
tak e care of th e stomg e ar e a which is wast e d b e tw ee n prefix l e ngth boundaries. W e n ee d to hav e 
25 e xtra bins (numb e r of different prefix l e ngths, from 8 to 32) p e r chip. By looking ot th e floor 
ptan in fig. 3, th e ar e a p e nalty is only 10%. This area p e nalty will doeroos e as w e go to kuger 
chips and small e r feature sia e t e chnologi e s (l e ss than 2% for a IcnnP chip size in 0.1 5pn 
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Figure 3 : Floor Plan of the Chip [FIGURE DROPPED] 

technology). Thufi w e obtain a v e ry low power dosign, sino e for e ach loolciy only on e ohip io 
activat e d and in e ach chip at most 25 CAM banlo} arc fired 
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The floorplan qq m e ntioncxi abov e is shown in fig. 3. Aloo th e pip e lining of th e operations 
is shown in fig. 2. All latch e s ^ith th e sam e color repres e nt one op e ration cyclo. Tho lat e ncy for 
e ach chip (and for th e whol e loolcup sinc e on e chip p e r loolay is activat e d) is 6. Th e e xpansion 
of th e syst e m by adding chips is also v e ry singl e and is only limited by th e numb e r of adA ess- 
lin e s allocat e d for addressing th e m e mory locations. 

For estimating th e ar e a, w e us e d th e formula giv e n in th e proj e ct handout. For CAM c e lls 
w e assum e d an ar e a of SOXxAOX., sinc e w e are using small e r trancistors than th e standard c e ll 
and w e could stack tho vias. For SRAM c e ll w e assum e d on ar e a of 2SXx50X, sinc e th e SRAM 
h e ight is tho sam e as the CAM c e ll h e ight. This way e ach CAM banlc will hav e an area of 5,7 
M^ ^ The logic for each banlc reqxiires 1.3 M>u^. Our Priority Encoders requir e 4 Mk^, Our 
d e cod e rs r e quir e 60 M^^. So th e total chy area will be 0.71 cnJ^. Sinc e th e asp e ct mtio of our 
chip is 1.11:1, th e chip will b e Q.91cm><0.82cm. 

4 .1 Funct i ona l testing and Vor il og Status 

Th e obj e ctiv e behind writing th e v e rilog code was 

♦ T e st th e soundn e ss of th e id e a 

♦ Check for any pot e ntial arohit e otund botd e n e clcs (lilc e larg e numb e r of war e s running all ov e r 
th e chip) 

^ — Ensure th e updat e can b e impl e m e nt e d k ee ping without impacting th e loolcup my much 

W e hav e met th e s e obj e ctiv e s. — Th e v e rilog cod e was writt e n right down to th e functional block 
l e v e l (i. e . just abov e th e gate l e v e l). For e xample w e impl e m e nt e d e ach CAM block with all the 
associat e d periph e ry logic. — W e could not figure out how to do it e rativ e wiring and v e ctoriz e d 
instantiations in v e rilog and sinc e w e want e d to inq)l e m e nt th e mod e l with corr e ct functionality 
w e took recours e to P e rl to do th e n ee dful. 

Th e controller was inyl e m e nt e d mosfly proc e durally. — While loolaq) was properly impl e m e nt e d, 
update is much more difficult — So we just uiylomentod a sinpl e updat e algorithm that cannot 
talc e car e of spills to adjac e nt bins. — How e v e r, w e did writ e a P e rl script that took th e 
loo/cup_tablc and creat e d th e initialization fil e with corroct bin partitions. 
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Since w e hav e a throo haoh structur e w e n e ed to worry about thre e Idnds of hafih ov e rflows. For 
overflows in th e addr e oG spac e hash e s th e updat e algorithm is 0(Number of Bin~2000) for 
overflows in pr e fix length hash e s th e \q)dat e algorithm is 0(Numb e r of e ntries^200K). — We 
d e vis e d the archit e ctur e in such a way that any CAM and th e aff e ct e d CCMs can b e writt e n in 
th e sam e clock cycl e without d e stroying th e data structure int e grity, — Thus w e can continu e doing 
loolcups, using only th e idl e cycl e s to do th e updat e s. For address spac e hash ov e rflows this will 
e ntail a p e nalty of 0(1^0) which is quite small and can b e slip e d in with holding th e looloy at 
all. Hov i ^ e v e r a prefix l e ngth haoh ov e rflow can in th e worst cas e hold up looloip for 
milliG e oond ft — Th e saving grac e is that th e s e updat e s (i, e . chang e s in distribution of routing prefix 
l e ngth) ar e lilcoly to b e v e ry infii' e quent Q(months) and thus with cl e v e r algorithms will It ee p this 
p e nalty downto aminimuia 

4 ,2 Setup of the c i rcuit critical paths 

From th e figure of th e overall chip op e ration in s e ction 2.0, ther e are 6 cycl e s p e r chip. 

Th e first and last cycl e s are I/O. Th e second and third cycl e s ar e CCM op e rations and th e fourth 
on e is th e CAM banlc op e ration. Th e fifth on e is th e chip priority e ncoder. Th e CCM operations 
and CAM op e ration ar e v e ry similar to e ach oth e r and both s ee m to b e in th e critical path. So 
both of th e se circuits should b e simulat e d. Th e priority encod e r is a static logic e valuation and 
should not bo a probl e m in terms of timing, but sinc e it is an inyortant ftmction of th e chip, it 
was simulat e d. 

Th e plac e m e nt of th e wir e s for th e units ar e shown in fig. ^. 

5.0 Circuit Approach and S i mulation Method 

As w e discuss e d in th e previous sections, w e hav e 2 basic Content Comparabl e 
M e mori e s: The usual CAM and th e CCM. — For th e usual CAM w e us e d the s e ri e s matchlin e 
structur e (or th e NAND match chain). As m e ntion e d b e fore this was us e d to sav e pow e r. Since 
we have a large number of CCM also (2018+ of th e m) , we had to come up with a compact 
stmcture for this memory. We observed that for comparing our IP address with the CCM 
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Figure 4 : Placement of the wires for the unit [FIGURE DROPPED] 

content, we can subtract these 2 numbers and see if the result is a negative number or not In 
logic terms, this means diat we have to 2's conq)lement one of our numbers and add them 
together. If the overall addition result is positive (i.e. the ectra bit for 2's complement is 1) there 
would be a cany generated, otherwise there would be no cany. We used a carry-chain 
architecture to implement our CCM. 

It is not desirable to do a 2's complement operation on the IP number for each lookup. 
One solution is doing the 2's complement operation on the CCM content when it is stored during 
an iq)date. Another solution is storing the original CCM content, but do the cany chain logic 
opoations on the inverse of the stored value. In this case there should be a cany input to the 
cany chain Since 2's conq)lement of a binary number is equal to bitwise inverse of that number 
plus 1, the end result will be the same as the first solutioa 
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Effectively the CCM content is subtracted fiom the IP number each time and a carry is 
generated if the IP number is Greater than or Equal to CCM content, hence the name G.E.CAM. 
Two possible in5)lementalions are shown in Fig. 5. Figs, (a) and (b) correspond to first and 
second solutions respectively. In both cases transistor Ml can be connected either to Vdd (done 



in Fig. 5(a) implementation) or to bitline (Fig. 5(b)). Connection to bitline may make the 
overall cell size smaller. Of course there could be other implementations for generating tiie 
ii^uts to the series and parallel carry chain transistors. 

Fronn RAM cell 
bitline (b) ^ * 
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Figure 5 - CCM carry chain 



W e us e d ono phas e clocldng ochom e . Each section has on e whol e clock cycl e to p e rform 
its operation. Static logic p e ctionfl (lilc e th e d e cod e r and Priority Encod e r) have th e whol e cycl e 
tim e to e valuate and so th e ir timing is not critical For pr e charg e logic o e ctions (hlco the CCM 
and CAM), the clock low cycl e is us e d for pr e charg e and clock high is ufiod for evaluation. 

To sav e pow^or our bitlin e s are not precharg e d (this will cut down th e pow^or by halQ. 
With this sch e m e it s ee ms that th e r e is no guarante e that all th e nodes in th e matcMin e chain ar e 
pr e chaig e d. To e liminat e th e pot e ntial charg e sharing problem, w e s e nd th e bitlin e data du riflg 
th e pr e charg e period so that th e matchlin e nod e s which are going to b e conn e ct e d to the 
matchlin e outputs ar e precharged correctly. 

As wao discuflGed in the pr e \iouQ section and also fix)m th e abov^j paragraph, w e hav e two 
close critical paths, th e CCM opomtion and the Ci\M banlc e valuation. Both w e r e simulat e d in 
spice. For CCM, th e w^hol e path was simuht e d, from tiio output of th e pr e vious stag e flip flop to 
th e mpxA of th e n e xt stag e flip flop. Sinc e it was obsor\^ e d that th e numb e r of addr e ss e s with 
more than 21 bit pr e fix l e ngths are v e ry small (0.1% from statistics), w e d e cid e d, without loss of 
g e nerality, to allocat e a f e w bins for more than 21 bit prefix l e ngths and th e n use 21 bit Ci^M's 
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and CCM's for storing th e rest of th e IP numb e ns. Sinc e th e word l e ngth for all CCM'o in this 
inylem e ntatin vvafi 21 bit, w e did th e simulation for this word length. 

For CAM, simulation path was fiom the output of th e previous stag e to th e input of fiens e 
anylifi e r (on e row of data). Th e delay of th e s e ns e amp was th e n e stimat e d and add e d to th e 
ov e rall d e lay number. Sinc e th e long e st word lengdi of CAM o e llfi is 32 bit, w e did our 
simulation for 32 bit long word. 

For inputs we assum e d that tii e y all hav e lOOps ris e and fall tim e (FOI ris e tim e in TTS& 
com e r in which w e or e m e asuring th e sp ee d of th e circuitry). For output of e ach section we used 
appropriat e loading. 

All th e wir e capacitanc e s and resistanc e s w e r e includ e d in the simulation. Th e s e were 
e xtract e d from th e gen e ral wiring schem e discuss e d in previotis s e ctions. For capacitances worst 
valu e was consid e red, that is it was assum e d that the m e tal lay e r is sandwich e d betw e en top and 
bottom metal lay e rs. For all cases the worst cas e d e lay was simulat e d, i. e . it was assum e d that th e 
signal should travel th e whol e length of wir e , although for som e loads it had to travel short e r 
distanc e s. 

Sinc e w e simulat e d on e complet e row in CAM, for many of the logic circuitry^ lik e fli e 
SRAM r e ad ^>Tit e circuitr>r th e corr e ct load aid fanout ar e alr e ady ther e in th e circuit. For signals 
and controls that go to all th e rows (lik e bitlin e o, e nabl e signals and clock signals), th e n e c e ssary 
fanout was simulat e d by using dummy gat e s and capaoitiv e loads. 

For m e oiiuring th e critical path d e lay, all th e initial conditions w e r e set to th e opposit e of 
th e final valu e s during th e cycl e we ar e simulating. This way all th e pr e charg e and e valu ^^ 
times ar e token into account corr e cdy. 

W e also did th e simulation for the vrnte operation in th e CAM banlts (including th e 
decoding) and Priority Encod e r, although fipm initial hand calculations it was ob\iou6 that the 
timing of th e s e circuits is not an issu e . 

5,1 C i rcuit Design, and Cr i t i ca l Path S i mulation 
5,1,1 CCM S i mu l ation 

Th e circuit schematic for e ach c e ll is shown in Fig. 6. i Sino e th e r e is no road operation 

firom th e CCM c e ll, ther e is no n ee d to hav e more than minimum siz e NMOS's in the SRAM 
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c e ll. For th e logical oparationfi (bitlin e .D) and (bitline+D) tmrnmiGoion gate logios w e re vifi e d to 
sav e ar e a. Th e GE lino ( e quivalent of matchlin e in nonnal CAM) tranfiifitors wer e chos e n to b e 
16X in siz e in ord e r to d e cr e as e fli e d e lay. Th e GE lin e s ar e inyl e ment e d with minimum v\ 4dfe 
(for low e r capacitanc e ) Ml, sinc e th e distanc e s ar e not that far. The wir e capacitanc e and 
r e sistanc e w e r e also includ e d in th e unit cell. Wir e capacitanc e was assum e d to b e Cgnd + 
2'*'Cadj, b e oaxjs e during th e evaluation half cycl e aU th e n e aifay lin e s ar e silent 



a- 
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GEout 



X i 

Gnd Gnd 



Figure 6 - CCM unit cell 
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The bitlin e circiuitry is thdwn in Fig, 7. — Th e bitline inverter was chos e n to bo minimum 
siz e b e ccaic e th e ceyaoitance on this bitline is only e quival e nt to 62X on bitlin e and MAj on 
bitlinejbor. Eight of thos e c e lls ar e conn e ct e d to each oth e r to form a CCM 8 bit block. Th e 
wordlin e wire load for this 8 bit s e ction is shown in Fig. 8. 
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Fig. 7 - The CCM ceU with bitlme circuitry [FIGURE DROPPED] 
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Fig. 8 - The wordline wire load for 8 bit section of CCM [FIGURE DROPPED] 
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Conn e cting all ihooo 21 bits (3 8 bit ooctionfl) will r e sult in a large d e lay (around 10 m). 
So matohlin e buffering io n ee ded to reduc e tfaio time. Buff e rs are put betw ee n e ach 8 bit poction, 
with the total of 2 buffer o e ts (2 inv e rt e rs for e ach s e t). Th e buffers b e tw ee n 2 s e ctions are shown 
in Fig. 9. 
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Figure 9 - 8-bit blocks with buffers between them [FIGURE DROPPED] 

Now th e qu e stion is that should w e precharg e our GE lines low^ and th e n charg e th e m 
high for a GE (Gr e at e r than or Equal to) or should w e do it oth e r way around, precharg e th e m 
high and then discharge them when a GE occurs? If w e had no int e rm e diat e buff e rs (only a final 
buff e r), then mayb e precharging GE lines low and then charging th e m high malc e s s e ns e . That 
way the lin e s initially go up very fast (becaus e transistors are on) but reach th e final value v e ry 

But this option is not good if w e hav e buffers in between, hi a test spice simulation, e ven 
though Domino gates wer e used for buffering, th e delay did not inyrovo from th e no buffer cas e 
by that much. Th e problem is that e v e n if th e buff e rs go sk e w e d in such a way that they switch 
at a low thr e shold value, th e ir uyut signal will n e v e r be v e ry strong and so th e buffer stages hav e 
a long d e lay. — So b e caus e of th e s e concerns the linos wer e pr e charg e d high and then dischaig e d 
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in oafi e of a GE. Th e prooharge trcmGiGtor io aloo ohown in Fig. 9. Th e prooharge io done through 
an NMOS tianfiistor. Th e reason is that w e do not want to charg e our GE lin e mor e than Vdd Vt. 
Charging bit lines more than that will turn oflF di e GE lin e tronsistoro e v^ e n more and will slow 
down th e GE lin e . 

Ea ch buff e r srt consistp of two inv e rt e rs. At first glanc e it s e ems that th e first inv e rter 
should b e skew e d such that it switch e s at a high threshold voltag e . But th e fact io that Ifa e GE lin e 
n e v e r charg e s up more than Vdd 2Vt which can b e v e ry low (Around 2volto for Vdd~3volto). So 
th e first inv e rt e r should b e actually sk e wed such that th e throshold l e v e l is lowor than normal 
inv e rter to provid e e nough noiso margin. By doing a s e t of simulation sw e eps, th e Wj tM0sAVNN40S 
was chos e n to b e 0.2 to provid e a switching point of around 1 volt (a noise margin of about 1 
volt). Th e s e cond inverter was skewed th e sam e way but this time for speeding up the circuit. 
Th e siz e s wh e re chos e n to b e >% h4Q S""20Aj and \\i tMQ S~^A . for both inv e rt e rs. For th e inv e rt e r at 
th e end of GE lin e th e sam e sizes w e r e us e d. Th e s e w e re obtain e d by optimizing th e RC line 
d e lay model. 
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Figure 10 - The output logic for CCM [FIGURE DROPPED] 
The output logic for CCM is shown in figure 10. Th e outpxit of this CCM should be s e nt 
to th e logic of the pr e vious CAM bank's CCM. The wire model is shown in tfao figure. Th e 
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output ifi buflF e rod boforo sending it to th e wir e to minimize d e lay. Th e bufF e ro aro also siz e d for 
best d e lay. Th e bufF e ro in the other path ar e add e d int e ntionally to e qualiz e th e d e lay of both 
pathfi as much as posoibla If Ifais is not don e , then th e r e is a possibihty of creating gUtch e o in th e 
e nabl e output (wh e n GE lin e is discharg e d for both this CCM and th e next on e , but flie n e ?ct one 
arriv e s lat e r b e caus e of ^vir e d e lay). Th e s e unwant e d outputs actually slow down th e circuit. 
With th e siz e s giv e n, th e r e was no unn e c e ssary transition on Enabl e output. Th e margin was so 
much tfiat e v e n if on e of th e paths b e com e sUghdy fast e r, th e Enable ghtch will be v e ry om aBr 

As w e s e e in Fig. 10 th e r e is a 'lafit' signal in th e logic. If th e 'last' signal ifi 1 in the n e xt 
CAM banlc, th e n th e e nabl e output will only d e p e nd on th e GE fix)m curr e nt CCM. This is 
e }q)lain e d in th e archit e cture s e ction. This 'last' signal is stor e d in MSB of mask bit, sinc e th e 8 
MSB's ar e not used in masldng (th e r e ar e no pr e fix lengflio loss than 8 ). 

Th e clock circuitry is shown in Fig. 11. Th e wir e loading is also included in th e mod e l 
(for th e wir e which should go through th e h e ight of th e CCM c e ll). 



Th e delay was measured for th e worst cas e GE op e ration, i. e . th e carry is g e nemt e d in th e 
LSB and should propagat e all th e way to the e nd. Th e m e asured d e lay from input to e nable 
output was m e asur e d to b e 2.61ns. This was m e asur e d in TTSS com e r. 

Sinc e th e thr e shold voltag e of th e first invert e r in e ach buffer set was s e t to b e low (by 
weak e ning th e PMOS) w e ch e ck e d th e SFSS com e r. Th e circuit work e d prop e rly. 

Th e input capacitance for dock is e quival e nt to lik. Th e input ocyaoitano e for bitlin e is 
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Figure 11 - Clock circuitry for CCM [FIGURE DROPPED] 
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5,1,2 Norma l CAM Simulation 

Many issues in CAM design ar e oimilar to th e CCM's issu e s and so ar e already 

discussed. Th e basic CAM c e ll is shown in Fig. 12. As e xplain e d b e fore w e have both 32 bit and 
21 bit CAM banlo). Since th e 32 bit on e is mor e critical, w e will simulat e this on e . 
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Fig. 12 - Basic CAM cell [FIGURE DROPPEDl 
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Tho matohlin e woro mod e l io the sam e cu? th e GE lin e in CCM. Choosing th e >JMOS 
transistor siz e as minimum siz e (f\X) h e lps minimizing th e area, sp e cially that CAM c e ll is 
cov e ring most of th e chip area. Th e matchlin e transistor and th e transmission gates are also 
chos e n to b e There are mad e as small as possibl e to me e t the 7ns cycl e time obj e ctiv e . Th e 
loading of oth e r 99 rows on th e bitlin e and also th e bitlin e wire capacitanc e and resistanc e are 
mod e l e d as shownri in Fig. 13. For wire capacitanc e still th e formula Qotal-Cgnd + 2'^'Cadj was 
us e d (worst cas e assuming M2 b e low and Ml abov e ). Sinc e th e tv s ^o bitlinoo aro switching in 
opposit e directions, it s ee ms that Ctotal ~ Cgnd + ^^Cadj should b e us e d for >vir e capacitanc e . 
But sinc e th e se bitlin e s are around 25X away from e ach oth e r, Cadj is pr e tty small and can b e 
ignor e d. 

m @) 
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Fig. 13 - Modeling the loading on the bitUnes [FIGURE DROPPED] 



Th e bitlin e input drcfuitry is shown in Fig, H. — The mask inyl e mentation is also shown in this 

figure. Only 21 bits of CAM hav e this inyl e mentation and for the oth e r 8 (MSB), masle bit is 

always 1. Th e mask bits are stor e d in an extra row of SRAM c e lls. When 'mask' bit is 0, th e n 

both bitlines will b e chaig e d to high. This way th e input of th e matchlin e transistor will be 

always high as d e sired for th e mask e d bits. 
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Figure 14 - CAM Bitline input circuitry [FIGURE DROPPED] 
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For d e signing tho drivers w e did not want to minimigo th e d e lay of th e driv e rs, b e caug e 
th e y ar e op e rating during th e prechaige tim e and do tfa e ir timing is not critical. Our only 
constraint w e r e having an inpvX capacitanc e of lOOX (th e acceptabl e output ccyacitanoe for th e 
FF we d e sign e d in HW M) and also p e rforming th e requir e d logic. W e also tri e d to e qualiz e the 
d e lay of th e two paths. Th e delay of tfi e bitlin e path is almost 7A¥0A and the bitlin e _bar is 
around 7.8FO^ (I Am). This d e lay tim e is acc e ptable consid e ring our cycl e tim e . 

For driving th e CAM bitlin e switches, tho circuit shown in Fig. 15. is used. — 

Updat e signal is high it m e ans that w e wont to wTit e into th e CAM and so th e switch e s ar e turn e d 
on. Lilc e v v is e wh e n Enabl e signal is high it means w e want to do a looloq) op e ration and so again 
th e bidin e switch is turn e d on. Wh e n Enabl e signal is low we do not writ e anything to th e bitlin e 
to save power. Th e output of this logic soos th e capacitanc e of a wir e th e l e ngth of th e CMl 
bank width. The d e lay of th e 2 paths ar e e qualiz e d Th e 'Swcam' d e lay is around 3.7 FO^ and 
th e 'Swbcam' d e lay is around 3.9 F01 d e lays. Th e input capacitanc e s w e r e assum e d to be lOOA^ 
for both 'Update' and 'Enable' signals to b e within th e rang e of load for FF. 



update ^ 
EnabteB- j N; 



svrbcam 



Figure 15 -CAM bitline switches circuitry [FIGURE DROPPED] 



For speeding up th e Ci\M operation, w^e divid e d the 32 bit CAM to 2 16bit CAM blocks 
and th e n NORed th e output r e sults. Sinc e our SRAM c e lls are short e r in h e ight than th e CAM 
c e lls, th e re is e nough space b e low e ach SRAM row for an e xtra pitch of wir e . So as shown 
b e for e , th e 2 16 bit CAM banlcs ar e plac e d on th e tw^o sid e s of the SRAM banlc and th e output of 
on e of them is passed below the SRAM cells to the oth e r sid e . Th e output logic of the CAM 
bloclcB is shown in Fig. 16. Each 16 bit block is also divid e d to 2 8 bit bloolcs and buffers ar e 
us e d b e tw ee n thes e 2 bloolcs, th e sam e way as CCM d e sign. 
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Figure 16 - CAM output logic [FIGURE DROPPED] 

Sinc e w e do not want to activat e SRAM wordlinos wh e n th e CAM banlc is not activ e , w e 
included th e * e nabl e _bar' signal in the NOR oporatioa Thio way wh e n * e nabl e '-high, th e circuit 
op e rat e s and wh e n ^ e nabl e '-low, th e SRAM wordlin e r e mains low. — Also th e wordlin e of -&^ 
CAM is NOR e d in th e s e cond NOR gat e for the writ e op e ration in th e SRAM during on lydat e 
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For tho first NOR, th e siz e s w e re ohoGon so that the thr e fihold voltage io low e nough to 
avoid th e prDbl e m disoucs e d in previous s e ction. Th e appar e nt e ff e ctiv e PMOS to NMOS ratio 
h e re is 12/0/3)-^ which is more than 5 which was assum e d for invert e r buffer gat e s. Th e r e ason 
is that wh e n for e xanyl e GEl is still high (around 2 volts), 'enabl e jbar* and GE2 can b e both 
low, which turns on th e 2 PMOS transistors and eff e ctiv e ly d e cr e as e s th e PMOS chain 

Sinc e th e logical e ffort of th e first NOR is v e iy high (around 8 ), th e n e xt NOR gat e is 
chos e n to b e minimum siz e . Th e ov e rall LE fi-om input of s e cond NOR to th e wordlin e io around 
55/20'*'5/3-1.6. Th e gat e s are siz e d to g e t minimum d e lay. Tho ov e rall delay of th e logic is 
e stimat e d to b e around 5F01, of which more than 3F01 is firom tho first NAND. 

In Fig. 16 logic for gen e rating ' e nablo_bar' is also shown. At th e iiyut, th e ollc signal is 
NAND e d wiih ^enabl e ' signal so that yAim elk is low (pr e charg e stat e ), th e 'enabl e _bar' signal is 
kept high and so SRAM wordlin e is not activat e d e rron e ously. Th e d e lay of this chain is ^tet 
critical as long as it is short e r than th e d e lay of th e matchlin e . Th e worst cas e wir e loading (th e 
full h e ight of th e Cam banlc) and also th e fanout of all th e slc e w e d NOR gates ar e also mod e l e d. 
This d e lay is around 3F01 which is l e ss than th e d e lay of GE lin e s and so it is ok. 
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Figure 17 - Basic SRAM cell [FIGURE DROPPED] 
Tho bofiic SRAM cell is shown in Fig. 17. Th e NMOS siz e of th e c e ll (ik) wao chos e n 
such that during a read op e ration, the low voltag e on th e SRAM output do e s not increas e mor e 
than SOQmV. This giv e s enough margin such that e v e n if the pass gate transistor b e comes 
strong e r due to mismatch and proc e ss variations, still th e read op e ration is don e without 
d e stroying tho SRi\M cont e nt. 

Th e mod e ling of th e bitlino wor e load and alfio the loading of th e other rows is don e th e 
sam e way as th e CAM cell and is shovvii in Fig. 1 8 . Pr e charg e tranoiotors are add e d for r e ad 
op e ration. 
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Figre 18 - SRAM with wire and load from other rows [FIGURE DROPPED] 



Th e SRAM r e ad and writ e cirouitiy is shown in Fig. 19. 
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Figure 19 - SRAM read and write circuitry [FIGURE DROPPED] 

For read circvutry, we hav e 2 switohoo which open tfao bitlin e s to the input of Ifa e S e ns e 
Anylifi e r (not shown). Th e 30fF capacitors ar e modeling th e s e nfi e amp input capacitors. 

For designing th e writ e driv e rs, again hk e th e CAM writ e circuitry, w e did not hav e to 

minimize th e d e lay. Since th e writ e circuitry is static, it has on e whol e cy^cl e to compl e t e its 
operation. Our only constraint was bitline input capacitanc e (60X in this cas e ). Th e d e lay of the 
fork is e qualiz e d. Th e d e lay of bitlin e is around 3.7 F01 and th e bitlin e _bar around 3.2 FOI from 
hand calculations. 

Th e SRAM rcxid and writ e switch e s should b e driven. Th e driving circuits arc ohown in 

Fig. 20. On e is driven by 'Enabl e ' to during a r e ad op e ration and the other one is driven by 
'Updat e ' for a writo oporotioiL Th e outputs of thos e logics s ee th e cq)acitanc e of a wire th e 
length of th e SRAM banlc width (plus th e switch transistors input c£q)acitanc e s). Th e delay of th e 
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2 paths oi e oqualizod. Th e 'S^vram' delay is around 2.9 F01 and th e 'Swbimn' delay m around 
2.7 FQ1 delays. For 'Swramw' and 'Swbramw' this d e lay io almost L9 F01. 
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Figure 20 - SRAM read and write switches driving circuitry [FIGURE DROPPED] 

To furth e r sav e pow e r during th e p e riod which th e CAM banlc is inactiv e (not e nabl e d), 
th e clock is also shut off wh e n ' e nabl e ' is low. Th e circuit shown in Fig. 21 is us e d for both 
dri\ing th e olook load and shutting off th e oloclc Wh e n 'enabl e ' is low, *g11c' is It e pt high and 
'ollcb' is k e pt low and so th e r e will not b e any pr e oharg e . 
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Figure 21 - Clock drive circuitry [FIGURE DROPPED] 
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The outputo of 'ollc' o ee s th e capadtmic e of a wir e th e l e ngth of th e SRAM bank width 
and th e SRAM bithn e pr e charg e transiston?. Th e d e kiy for 'elk' signal is around 1 FOd. 

The *clkb' signal output s e es th e load fiom a wir e which nms th e full h e ight of th e CAM 

bank. It also s ee s 100 pr e charg e transistors, out of which A transistors are pres e nt in th e circuit. 
Th e other 396 are mod e l e d by a dummy transistor load. Th e capacitor load of Ifaio dummy 
transistor is chos e n to b e v e ry small becaus e in actual cas e also only on e of th e matchlin e s 
should b e pr e charg e d and th e oth e rs should b e at th e ir high valu e . Th e d e hy for 'cllcb' signal is 
around 1 F01. 

Th e d e lay of th e loolcup was measur e d by ono of th e s e 2 m e thods: 1 Tho delay from th e 

rising e dg e of th e clock to th e rising of SRAM wordlin e was m e asured and then 0.65ns was 
add e d for the S e ns e Amp r e ad op e ration , or 2 Th e delay fiom th e rising e dg e of th e clock to th e 
point wh e re th e SRAM bitlin e output changes by 250mV was m e asur e d. Th e se numb e rs (0.65n6 
d e lay or 250mV w e r e both e xtract e d from EE313 not e s). Both m e thods r e sult e d in th e sam e 
d e by valu e . Th e ov e rall e valuation delay of th e looloq) op e ration is around 3.5 ns fi:om th e clock 
edge (in TTSS com e r). BUT this d e lay Scorn th e falling edg e of 'cUd)* signal is only 2.65ns. So 
if th e FF for this stag e is driv e n by this d e lay e d clock e d w e can borrow som e tim e fiom th e n e xt 
clock cycl e , as shown in Fig, 22. Sinc e the n e xt op e ration is Priority Encod e r and that op e ration 
is fj&y static, it can afford to l e nd som e tim e to th e pr e vious op e ration. 

Original Clock 

CAM Clock I i^HHjl^HH I I 

CAM PE 
Figure 22 - Cycle Borrowing [FIGURE DROPPED] 

Th e 2.65no operation allowo uo to op e rat e at a 7no cycl e tim e (consid e ring 1.5 F01 olook 
ok e w, 2F0'1 FF d e kiy and 15% cycl e tim e ). 
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The input — o{q)aoitanc e s — fef — Ais — stage — afe — as — follows: — Cloclr'196Xf^7 8 fF, 
Enablo-^250A.-lQ0fF, Updat e -23 5^-91 fF. 

5ili3 Decoder design and Memory write timing 

Th e d e coding io don e in 2 cycl e s. First 8 bit d e coding io don e to choos e on e of the 256 

banlcfi on th e chip (th e structur e io shown e arli e r in th e r e port). In th e noxt cycl e , a 7 bit d e coding 
is don e to choos e th e row of GMSl SRAM banlc In the sam e cycl e th e data io vmtt e n in th e 
CAM and SRAM So this s e cond d e coding is mor e critical than th e first on e . Th e 7 bit address 
spac e is divid e d to a 3bit and Ibit addr e ss spac e s. So we prodeood e on 3bit, th e n pr e d e ood e on 
Ibit and th e n combin e th e m. Th e 3 bit pr e d e oode is faster than th e 1 bit on e , so the circuit for th e 
1 bit predeood e is design e d Th e circuit is shown in Fig. 23. 




4J« **• 

My -fa 

Figure 23 - 4:16 Decoder [FIGURE DROPPED] 

First a 2'A d e coding is don e . Th e n a 1:16 d e coding is p e rform e d. The 16 r e sulting lin e s 
ar e th e n nm through th e whol e h e ight of th e CAM — SRAM banlc to do the 16.8^128 decoding. 
Th e equivalent wire capacitanc e is 262fF which is used in the simulation. A&ar th e final 
d e coding stag e , th e global wordlin e is driv e n. The capacitanc e of this hn e (ataiost 1cm long) is 
around 2pF. Th e r e sistanc e of this line is around 125 ohms, which was ignor e d. Consid e ring this 
r e sistanc e in our sizing calculations caus e s the buff e r stage driving global wordhn e to b e com e 
v e ry larg e , which is unnoc e ssoiy, sinc e th e d e coding time is not that critical. Th e wir e r e sistanc e 
adds around 2 3F01 delay to th e total d e lay of 10 FOI, which is w e ll within acceptabl e rang e . 
Th e simulated d e coding tim e is 2ns and th e overall writ e tim e is m e asur e d as 2.53ns, So writing 



Application Number 10/017,676 



Marked-Up Specification 



Page 45 of 49 



operation is not a problom and can bo don e w e ll wdthin th e int e nded cycl e timo of 7nfi. 
(R e m e mb e r that both th e d e coding and writ e op e rationo ar e static op e rations and so can b e don e 
during th e whol e cycl e ), 

Sinc e th e wiit e op e mtion dep e nds on th e strength of th e cell PMOS and th e pass gat e 

NMOS (NMOS is strong e r), in SPSS com e r write op e ration may fail. This com e r was check e d 
and th e writ e op e ration was don e correctly. 
Th e addr e ss input capacitanc e is (9.6fF). 

5.1,4 Priority Encoder 
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Figure 24 ^Priority Encoder Tree Structure [FIGURE DROPPED] 

Our priority e ncoder encod e s priority among th e CAM banks. Sinc e thes e ar e distribut e d around 
th e chip th e r e ar e lilcely to be v e ry long wires som e wh e r e in th e decod e r. Following th e principl e 
of putting th e wire capacitances as far in th e gat e chain as possibl e , all th e gat e s ar e nearly 
always dominat e d by \ ^ dr e capacitanc e s ( e xp e ct for certain gat e s with hug e fanouto). — Also sinc e 
w e hav e only on e priority e ncod e r, area is not r e ally an issu e . — So, w e wont in for a distribut e d 
static tr ee PE. — This also allow e d this pip e lin e stag e to lend som e tim e to th e CAM s e arch stag e 
incr e asing our loolcup sp ee d furth e r. For simplicity of design as w e ll as power consid e rations w e 
w e nt in for a two stag e 16x16 (256) d e coders as shown in th e flooiplaa — Betw e en th e two 
d e cod e rs w e had 10 AND gat e stag e s so th e avaikibl e tim e was divid e e qually b e tw ee n these 
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gat e s. Taldng 7no oyclo tim e , giving 0.8ns to th e CAM stage and accounting for flipflop d e lays, 
clock sk e w and un e von clock duty cycl e w e hav e to meet a total d e lay sp e c of 5.^ns. While w e 
want e d to b e comfortably insid e this sp e c without wasting area and pow e r on laige fast gates so 
w e choos e 3.5ns as th e targ e t. 

A typical stag e (AND gat e ) in tti e PE looks as shown in figur e 25. 




Figure 25 - Typical stage in the PE (AND gate) [FIGURE DROPPED] 



To g e t a total d e lay of 350ps (^IOt^,. . ) p e r stag e \\Titing th e d e lay e quations w e g e t a fanout of 9.1 
across th e whol e gat e . — Th e wir e capacitances and gat e load w e r e calculat e d from figur e 21 by 
tracing th e positioning of e ach gat e r e lativ e to th e layout on th e chip. — Th e critical paths ar e 
shown in th e figur e 21. Th e gre e n path is th e critical path for th e first stage and th e red is that for 
th e s e cond stag e . — Simukitions at th e TTSS com e r gav e a d e ky of 3. 8 ns including all th e ^ore 
capacitanc e s m ee ting th e sp e c comfortably. 



5.2 Powor Est i mate 



The main sourc e s of pow e r consumption ar e as follows: 
4 — Bitlines of CCM 's and CAM (and SRiMvl's) and oth e r lines insid e th e m 
2 — Global wir e s insid e th e chip, mainly th e IP number bus, in this cat e gory th e r e ar e also 
global word lin e s but th e y are only activ e during the update. Sinc e updat e frequ ent 
is a very small fraction of looloip fi:equ e noy, w e can ignor e this pow e r. 
3 Power consum e d in di e logic, lilc e in Priority Encod e r and CAM logic. Th e sam e as 
above, w e can ignor e th e pow e r burnt in th e decoder. 
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W e Gimulat e d tho Ci\M'G, CCM and Priority Encod e ro at fti e 11 I f com e r and int e grat e d 
th e curr e nt through Vdd to obtain the pow e r. Here are th e results: 

4 — For 32 bit 100 e ntty CAM banlt, we obtain e d th e av e rage e n e rgy per cycl e to b e 0.15 
iJ and th e av e rag e en e rgy p e r cycl e when all th e bitlin e s switch to b e 0.83 nJ. So 
0.68nJ is du e to bitlin e s. Sinc e at any time half th e bithn e s switch on av e rag e , then 
bithn e s concum e about 03Axi3 p e r cycl e . So th e av e rag e comamytion per cycl e for 32 
bit CAM banlc is 0.19nJ. At e ach cycl e , 25 CAM banlcs for only one chip ar e 
activat e d. — So th e total en e rgy consum e d in CAM banloi (bitlin e s and logic) will b e 
12.25 nJ per cycle. 

2 — For 32 bit CCM, th e e n e rgy consum e d p e r cycl e is pJ for all bitlinoo G\\itching. So 
tfi e average en e rgy consum e d is almost half of this value which is 12 pJ. Sinc e in 
e ach cycl e 256 of th e s e CCM's activat e in on e chip, they will consum e 3.1nJ of 
pow e r. 

3 — From simulation, Priority Encod e r was consuming 0.16nJ per cycl e . This was for the 
critical path of th e PE. Assuming a factor of 2 for the whol e encoder and considering 
that we hav e 17 of th e s e PE's on e ach chip, th e overall e n e rgy consumption by PE 
\ ^ dll b e 5.inJ p e r cycl e . 

4 — Now w e should calculat e th e energy for charging up IP number bus e s insid e the 
activat e d chip. From chip floor plan we hav e 5 of th e s e bus e s. Th e s e bus e s ar e 1cm 
long and so each hn e has a cq)acitanc e of 2pF. So the total capacitanc e is 320 pF. 
Sinc e on av e rag e half of th e s e lin e s switch, th e e n e rgy p e r cycl e will b e 1.75 nJ. If a 
match is found, w e only activat e on e port address bus. Sinc e th e port address bus is 
only 5 bit long, its contribution to e nergy consunytion is negligibl e . 

If w e sum up th e s e numb e rs, th e ov e rall e n e rgy consunytion p e r cycl e (for all chq)s) will 
b e 22.5nJ per cycl e . For 50 MHz clock fiequ e ncy (20 ns cycl e tim e ) this is e quival e nt to 1.125 
W ov e rall or 1 125/8^11 ImW p e r chip. For 7ns cycl e tim e (H2MHz clock), this pow e r is 
lOOmW p e r chip. 
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6.0 Concluding R e mark s 

Wh e n w e set out to d e cide th e ov e rall architecture of our d e sign w e made th e foll e^^qfig 

obs e rvationfi: 

^ — Hardwar e impl e m e ntationfi invariably had poor algorithm design and henc e burnt a lot of 
pow e r 

♦ Softwar e (or proooGSor bas e d) d e sign whil e having a good algorithmic d e sign suffer e d from 
th e m e mory proc e ssor bujs bottl e n e ck and w e re h e nc e serial in nature and also consum e d a 
lot of ar e a but storing in e fficient data stmctur e . 
Sinc e w e started with virgin sihcon it did not malc e s e nse to talc e e ither of th e s e approaches but 
to com e up with a e fficient parall e l d e sign combining the best of both worlds. 
Overall w e think w e hit all the thr ee k e y pins: small e st ar e a, low e st pow e r and high e st spe e d 
without going in any extr e m e dir e ction. Also, it should b e remembered that our hashing is 
totally fl e xibl e without dep e nding on any IP distribution statistics in th e address spac e . 

7,0 G e n e ral i zation s and Ext e ns i ons 

Our hardware bas e d s e arch and pr e classification ideas ar e not only limit e d to th e 

impl e m e ntation d e scribed in this r e port. — Throughout th e report w e point e d out som e 
gen e ralizations. Th e following list e jqptuns a few mor e of th e s e g e n e ralizations ond e xtensions. 
Description and Operation of Alternative Embodiments. 

1- We are not limited to SRAM for inq)lemrating our CAM and CCM cells. Any kind of 
memory cell including DRAM can be used as the storage element (circuitry/device). 

2 - Any kind of matchlin e impl e m e ntation can b e used for th e CAMs. In th e curr e nt 
impl e m e ntation — we — «sed — series — transistors — in our — CAM — matchlin e (NAND like 
matchlin e ). Parall e l transistor inylem e ntation of th e matchlin e (NORlilce matchlin e ) or 
any combination of th e two might b e us e d as w e ll. 

3- In this implementation CCMs were used for doing the 'greater than or equal to' 
operation. In general CCMs can be used for any comparison operation. 
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4- The word length for CAMs and CCMs does not have to be 32 bits. The same ideas 
e5q)Iained in this report works for any arbitrary word length. 

5- The CAM bank size can be chosen arbitrarily (In this in:q)lementation it was 100). 

6- We had 3 levels of pre-classification in this inq)lenientatiQn, out of vvtech 2 of them 
where in the address space (Fig. 1) . The number of levels of pre-classification is not 
central to our idea and can be chosen as appropriate for the particular appUcation. 

7 - W e ore not limit e d to th e Gtngle phas e olocldng sch e m e , ao woo used in this 
inq)lem e ntatioa 

8- By providing multiple matcMines for each storage element in our CAMs, we can 
perform several lookups in parallel and further speed up our search. 



